[OSM-dev] Reducing osm2pgsql memory usage using a database method

Mon Mar 12 09:27:28 GMT 2007

Hi Frederik,

>
>> Is this using a true XML reader or a simple line matching approach?
>
> I never use "true XML readers" (would have to be a SAX parser here)  
> for
> the planet file. Reasons for this:
>
> * the planet file is not true XML (quoting problems, UTF-8 problems),
> most libraries will complain

I don't understand your argument here. If there are  problems with  
planet XML lets fix them, parsing XML with regex wont get you far.

>
> * a lot of efficiency can be gained by making the assumption that the
> file consists of nodes first, then segments, then ways, which,  
> granted,
> is a "hack" as theoretically the XML could be in any sequence. If I  
> drop
> my regular expressions because I say that the XML format could change
> anytime, then I'd also have to drop assumptions like this, and that
> would probably catapult me far beyond the 100 minute ballpark  
> mentioned.

Ok, it is a hack after all.

>
> * regular expressions are faster (for this specific application and  
> when
> doing it with Perl)

What application are you talking about? How this application relates  
to osm2pgsql?  Can you prove it?

> If the only problem of that algorithm is memory consumption, could one
> not simply run it in multiple passes, dividing the globe up in a  
> number
> of bounding boxes and working them one after the other, with a little
> bit of overlap to allow for long ways/segments? The size of the  
> bounding
> boxes could be chosen heuristically based on the file size of the  
> planet
> file and the amount of available memory, so someone with a 4 gig  
> machine
> could still do the whole file in one pass, and if there's only 512 mb
> available it would also work, just slower? - If you always use a DB
> backend for transient then it'll always work but more memory will only
> give you an advantage if efficiently used for database caches.

> All wildly speculating since I'm way out of my depth here,

I understand.

> I was just
> taking exception at the original argument "let's ditch C in favour of
> C++ because there we have hash tables".

'Hash tables' is not the only reason to consider C++.
FYI osm2pgsql is already linking to GEOS which requires C++ compiler.

I suggest you study the subject more carefully before writing long  
emails.

Cheers,
Artem