[OSM-dev] Reducing osm2pgsql memory usage using a database method

Sat Mar 10 18:43:08 GMT 2007

On Sat, 2007-03-10 at 14:56 +0000, Artem Pavlenko wrote:
> Jon,
> 
> > I've just uploaded an experimental version of osm2pgsql which uses
> > Postgresql database tables for the transient node and segment storage.
> > This drops the memory usage from >1GB to ~60MB. On the downside, the
> > import time has gone up from 20 to 100 minutes. I'm sure this can be
> > improved though with some more database Mojo.
> >
> > For further details see SVN (utils/osm2pgsql/experimental/readme.txt)
> > or
> > http://trac.openstreetmap.org/browser/utils/osm2pgsql/experimental/ 
> > readme.txt
> >
> Good stuff.
> 
> I'm working on new osm.xml and I have some ideas on how to improve  
> osm2psql output:
> 
> 1. We can re-write osm2pgsql in c++ and take advantage of dynamic  
> structures e.g std::map, safe formatting and casting   
> boost::lexiacal_cast,  boost::format and more.
> 

I think this makes sense. Handling items like the tag, segment and
attribute lists should be significantly simpler in c++.

> 2. At the moment there are a lot of redundant data in output tables.  
> Everything apart from geometries are dumped as 'TEXT' .
> 

> We can have a more flexible design where table structure, attribute  
> values are configurable (at compile time). 

The current export tags table could easily be read in at run time. 
One possibility would be to move some of the rules from osm.xml into
osm2pgsql, e.g. the roads, leisure, water, text could become separate
tables instead of requiring select statements in the osm.xml file. I
guess osm2pgsql could even be taught how to interpret the osm.xml file.

> Consider this for example:
> To render highway features in correct order I want to have z_order  
> field in planet_osm_table calculated as follow:

> int  z_order ( osm_feature const& feat)
> { 	
> 	int layer = 0; //default
> 	try
>          {
> 	     layer = boost::lexical_cast<int>(feat['layer']);
> 	}
> 	catch (boost::bad_lexical_cast & )
>          {
>                // layer tag has got lots of junk!!!
>          }
> 	int highway_z = 0; // 0..9
> 	std::string highway = feat['highway']
> 	if ( highway == 'motorway' || highway == 'motorway_link')
>          {
>               highway_z = 9;
>          } 	
> 	else if (...) {}
> 	....
> 
> 	bool bridge = false;
> 	try {
> 	    bridge = boost::lexical_cast<bool>(feat['bridge']);
>          catch (...)  {}
>          return   10 * ( layer + bridge?1:0)  + highway_z ;
> }
> 
> Also I want to have consistent numeric feature_type calculated  
> differently depending on tags/values. This will make rendering more  
> efficient and will bring some (needed) sanity to styles in osm.xml.
> 
> 3. Also we can abstract 'output writing' to have multiple back-ends :  
> mysql, sqlite , shapefiles etc .
> 
> What do you think?
> 

Makes sense to me. I would consider having a abstraction layer for both
the data output and also the transient storage (array in RAM, database
tables, mmaped file).

Maybe it should adopt a plugin architecture with a config file much like
mapnik. This might allow multiple simultaneous outputs, e.g. roads to
DB, coast to shapefile. 

> Artem.
> 
> PS. I'm running mapnik , postgresql , osm2psql on mac os x now and I  
> get substantial performance improvements mainly from faster disk i/o.  
> I wonder if we should look into finding optimal filesystem for  
> postgresql. I'm using ext3 on linux.
> 

Are you sure you are comparing like-for-like systems? I run on
Linux/ext3 and do not notice any significant IO issues. Mind you, my
machine has 2GB of RAM so a lot is cached and I have 4 disks in a RAID5
setup so the IO rate is quite good.

Anecdotally, I have heard that OS-X tends to be slightly slower than
Linux in most tests (I had better run for cover now...).

	Jon