[OSM-dev] Reducing osm2pgsql memory usage using a database method

Sun Mar 11 17:25:32 GMT 2007

On Sun, 2007-03-11 at 17:37 +0100, Frederik Ramm wrote:
> Hi,
> 
> > There are several perl interface to osm, but working with planet.osm in perl 
> >  - takes 40 minutes only for reading 
> >  - takes 100Bytes of memory for each node stored in memory. 
> >    Which is more than 10 times more than really needed.
> 
> That's not accurate. My script for determining the last-modified time of 
> level-12 tiles processes the full planet file - actually more of it, 
> because it first seeks to the begin of the way portion, then re-starts 
> and seeks to the segment portion, then reads nodes, all to save memory - 
> in less than 15 minutes.
> 

Is this using a true XML reader or a simple line matching approach? 

Many of the Perl scripts to parse OSM data use line based string
matching which always seems like a hack to me since there are no real
guarantees about how the XML lines will be formatted.

> And storing a node with lat/lon as float and it's id will need at least 
> 12 bytes in any programming language I know, PLUS the overhead incurred 
> by any hash map structure you decide to use... granted, you'll not be 
> able to live on 12 bytes for a Node in Perl, but you cannot hold Perl 
> responsible for the thoughtless way in which people write their scripts.
> 

My experience in trying to optimise the memory use of the simplify.pl
script was that Perl did require something of the order of 100 bytes per
node using a hash (this is on a 64 bit machine which makes things worse
too). 

In summary, Perl seems great for holding thousands of large sized
objects. When you get to manipulating millions of tiny objects the
overhead introduced by Perl is huge (you better have hundreds of MB of
memory available). The only way to work around this seems to be to
implement some custom data storage in C (e.g. see Bit::Vector) or a
backing store (e.g. DBI or custom binary file format).

I agree that it is possible to write bad code in any language, but some
algorithms require holding lots of data within quick reach to operate
effectively, not everything can be written to use a streaming model.

It was this experience which made me choose C to implement the current
osm2pgsql code (since the algorithm that it was based on required the
transient storage of all nodes and segments). 

I'm experimenting with using a DB backend for the transient storage and
I'm revisiting whether a partially streamed algorithm can be developed
using 2 phases: First stream all nodes, segments &  ways into DB tables,
then use queries to convert the data into the linestring geometries. 

	Jon