[OSM-talk] Planet Dump

Jon Burgess jburgess777 at googlemail.com
Sat Mar 10 12:44:00 GMT 2007


On Fri, 2007-03-09 at 10:59 +0000, Artem Pavlenko wrote:
> On 9 Mar 2007, at 10:40, Keith Sharp wrote:
> 
> > On Fri, 2007-03-09 at 09:27 +0000, Artem Pavlenko wrote:
> >>>
> >>> I was about to try setting up PostGIS + Planet + Mapnik on my home
> >>> system, how much memory does osm2pgsql need?
> >>
> >> I run osm2pgsql on my laptop with 512Mb of RAM.  It takes around
> >> 20-40 min to convert the whole planet.
> >> Don't forget to pipe planet.osm  through UTF8Sanitize.
> >
> > What impact do error reports from UTF8Sanitize have on the output XML
> > file?  With the latest planet.osm I am getting:
> >
> > $ cat planet-070307.osm | ./svn/utils/planet.osm/C/UTF8sanitize >
> > planet.osm
> > Error at line 72505269
> > Error at line 72583739
> > Summary:
> > chars1: 3301140876
> > chars2: 139176
> > chars3: 1589
> > chars4: 0
> > chars5: 0
> > chars6: 1
> > lines : 73468080
> >
> > Is it safe to continue or do I need to investigate these errors  
> > further?
> 
> I get the same results which is good. I'm not sure if UTF8Sanitizer  
> actually replaces invalid UTF-8 with something but it is safe to  
> continue.
> 

UTF8Sanitizer overwrites any invalid UTF-8 character sequences so that
the output should be valid UTF-8. Generally is it name tags which
contain the invalid characters so the structure of the file is
unaffected.


> BTW, running latest osm2pgsql with latest planet-070307.osm  fails   
> at osm2pgsql.c:404 (assert)
> I just commented out that line and it's still running.  I don't think  
> that test is correct anyway, Jonb?
> 


That should be safe. It fails because either the latitude or longitude
of a node is 0. I think the only risk is that this node gets ignored
later on as 0,0 is used to represent a node which does not exist.
Strictly the asset should probably be "lat || lon". I littered the code
with lots of asserts to catch both bad data and to detect whether the
assumptions about the data I was making were correct.

I'm currently working on a rewrite to use a DB for intermediate storage.
I'm also cleaning up some of the code and algorithms along the way. More
on this in another email shortly.

	Jon






More information about the talk mailing list