[OSM-talk] Planet Dump

Sat Mar 10 12:44:00 GMT 2007

On Fri, 2007-03-09 at 10:59 +0000, Artem Pavlenko wrote:
> On 9 Mar 2007, at 10:40, Keith Sharp wrote:
> 
> > On Fri, 2007-03-09 at 09:27 +0000, Artem Pavlenko wrote:
> >>>
> >>> I was about to try setting up PostGIS + Planet + Mapnik on my home
> >>> system, how much memory does osm2pgsql need?
> >>
> >> I run osm2pgsql on my laptop with 512Mb of RAM.  It takes around
> >> 20-40 min to convert the whole planet.
> >> Don't forget to pipe planet.osm  through UTF8Sanitize.
> >
> > What impact do error reports from UTF8Sanitize have on the output XML
> > file?  With the latest planet.osm I am getting:
> >
> > $ cat planet-070307.osm | ./svn/utils/planet.osm/C/UTF8sanitize >
> > planet.osm
> > Error at line 72505269
> > Error at line 72583739
> > Summary:
> > chars1: 3301140876
> > chars2: 139176
> > chars3: 1589
> > chars4: 0
> > chars5: 0
> > chars6: 1
> > lines : 73468080
> >
> > Is it safe to continue or do I need to investigate these errors  
> > further?
> 
> I get the same results which is good. I'm not sure if UTF8Sanitizer  
> actually replaces invalid UTF-8 with something but it is safe to  
> continue.
> 

UTF8Sanitizer overwrites any invalid UTF-8 character sequences so that
the output should be valid UTF-8. Generally is it name tags which
contain the invalid characters so the structure of the file is
unaffected.

> BTW, running latest osm2pgsql with latest planet-070307.osm  fails   
> at osm2pgsql.c:404 (assert)
> I just commented out that line and it's still running.  I don't think  
> that test is correct anyway, Jonb?
> 

That should be safe. It fails because either the latitude or longitude
of a node is 0. I think the only risk is that this node gets ignored
later on as 0,0 is used to represent a node which does not exist.
Strictly the asset should probably be "lat || lon". I littered the code
with lots of asserts to catch both bad data and to detect whether the
assumptions about the data I was making were correct.

I'm currently working on a rewrite to use a DB for intermediate storage.
I'm also cleaning up some of the code and algorithms along the way. More
on this in another email shortly.

	Jon