[OSM-dev] osm2pgsql: diffs involving relations now work also

Wed Jul 16 00:39:49 BST 2008

On Tue, 2008-07-15 at 18:57 +0100, Jon Burgess wrote:
> On Fri, 2008-07-11 at 15:59 +0200, Martijn van Oosterhout wrote:
> > I've just committed the necessary changes to make diffs that change
> > ways properly reconstruct the relations that refer to it. That just
> > leaves the parking node problem which hasn't got a nice solution yet.
> > I don't expect it to be a problem in the short term though.
> > 
> > What this code needs now is lots of testing. To run this on a full
> > planet dump you will need on the order of 40GB of disk space, possibly
> > more (if anyone gets it to run to completion, could you post the disk
> > used, thanks). Note it will probably take quite a while unless you
> > have more than 1GB of memory and configure the --cache parameter to
> > use it. 10 hours is not unusual.
> 
> I have imported last weeks planet dump into a new set of tables on the
> main tile server. The increase in disc usage was 41GB and took about 2
> hours longer than the non-slim import:
> 
> time ./osm2pgsql --slim -p slim -C 2000 /home/www/tile/direct/planet/planet-080709.osm.gz
> ...
> 9156.42user 96.11system 5:28:28elapsed 46%CPU (0avgtext+0avgdata 0maxresident)k
> 
> 
> > That said, testing can be done on a smaller scale. Pick two extract
> > from the same areas and use osmosis to make a diff. Load the first
> > file, then load the diff and compare it to the second file (somehow).
> 
> I'm starting to apply the daily diffs now. I'll let you know how I get
> on. 

I'm afraid the process to apply the diff has not gone well. I left the
osm2pgsql process run for 3 hours. During this time the query taking all
the time was node_changed_mark(). GDB and explain analyze both show that
this query takes something in the order of 1 minute per node. In the
whole 3 hours osm2pgsql had only processed 238 lines of the .osc file! 

gis=# explain analyze execute node_changed_mark(236612);
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on slim_ways  (cost=4168.05..71998.07 rows=18978 width=153) (actual time=36054.417..36054.417 rows=0 loops=1)
   Filter: ((nodes @ ARRAY[$1]) AND (NOT pending))
   ->  Bitmap Index Scan on slim_ways_nodes  (cost=0.00..4163.30 rows=20182 width=0) (actual time=6010.918..6010.918 rows=136147 loops=1)
         Index Cond: (nodes @ ARRAY[$1])
 Total runtime: 36054.548 ms

It looks like the gist index is not coping well with this data. The
bitmap index scan above fetches 136147 rows which looks way too high to
me. In general the number of ways for each node should be very small.

I'm currently trying to building a gin index to see if that does any
better.

I suspect that we may need to add 'ways integer[]' into the nodes table
to efficiently mark the updated ways. What do you think?

	Jon