[OSM-dev] Reducing osm2pgsql memory usage using a database method

Sun Mar 11 21:54:07 GMT 2007

Hi,

> Is this using a true XML reader or a simple line matching approach? 

I never use "true XML readers" (would have to be a SAX parser here) for 
the planet file. Reasons for this:

* the planet file is not true XML (quoting problems, UTF-8 problems), 
most libraries will complain

* a lot of efficiency can be gained by making the assumption that the 
file consists of nodes first, then segments, then ways, which, granted, 
is a "hack" as theoretically the XML could be in any sequence. If I drop 
my regular expressions because I say that the XML format could change 
anytime, then I'd also have to drop assumptions like this, and that 
would probably catapult me far beyond the 100 minute ballpark mentioned.

* regular expressions are faster (for this specific application and when 
doing it with Perl)

> Many of the Perl scripts to parse OSM data use line based string
> matching which always seems like a hack to me since there are no real
> guarantees about how the XML lines will be formatted.

As always - it depends. Carefully crafted regular expressions can go a 
long way (can carelessly used SAX parsers can fall on their noses before 
you know it).

> My experience in trying to optimise the memory use of the simplify.pl
> script was that Perl did require something of the order of 100 bytes per
> node using a hash (this is on a 64 bit machine which makes things worse
> too).

I thought it must be less but a quick experiment proves you right.

> I agree that it is possible to write bad code in any language, but some
> algorithms require holding lots of data within quick reach to operate
> effectively, not everything can be written to use a streaming model.

Definitely true.

> It was this experience which made me choose C to implement the current
> osm2pgsql code (since the algorithm that it was based on required the
> transient storage of all nodes and segments). 

If the only problem of that algorithm is memory consumption, could one 
not simply run it in multiple passes, dividing the globe up in a number 
of bounding boxes and working them one after the other, with a little 
bit of overlap to allow for long ways/segments? The size of the bounding 
boxes could be chosen heuristically based on the file size of the planet 
file and the amount of available memory, so someone with a 4 gig machine 
could still do the whole file in one pass, and if there's only 512 mb 
available it would also work, just slower? - If you always use a DB 
backend for transient then it'll always work but more memory will only 
give you an advantage if efficiently used for database caches.

All wildly speculating since I'm way out of my depth here, I was just 
taking exception at the original argument "let's ditch C in favour of 
C++ because there we have hash tables".

Bye
Frederik

-- 
Frederik Ramm  ##  eMail frederik at remote.org  ##  N49°00.09' E008°23.33'