[OSM-dev] Reducing osm2pgsql memory usage using a database method

Sat Mar 10 20:04:00 GMT 2007

On 10 Mar 2007, at 18:43, Jon Burgess wrote:

> On Sat, 2007-03-10 at 14:56 +0000, Artem Pavlenko wrote:
>> Jon,
>>
>>> I've just uploaded an experimental version of osm2pgsql which uses
>>> Postgresql database tables for the transient node and segment  
>>> storage.
>>> This drops the memory usage from >1GB to ~60MB. On the downside, the
>>> import time has gone up from 20 to 100 minutes. I'm sure this can be
>>> improved though with some more database Mojo.
>>>
>>> For further details see SVN (utils/osm2pgsql/experimental/ 
>>> readme.txt)
>>> or
>>> http://trac.openstreetmap.org/browser/utils/osm2pgsql/experimental/
>>> readme.txt
>>>
>> Good stuff.
>>
>> I'm working on new osm.xml and I have some ideas on how to improve
>> osm2psql output:
>>
>> 1. We can re-write osm2pgsql in c++ and take advantage of dynamic
>> structures e.g std::map, safe formatting and casting
>> boost::lexiacal_cast,  boost::format and more.
>>
>
> I think this makes sense. Handling items like the tag, segment and
> attribute lists should be significantly simpler in c++.
>
>> 2. At the moment there are a lot of redundant data in output tables.
>> Everything apart from geometries are dumped as 'TEXT' .
>>
>
>> We can have a more flexible design where table structure, attribute
>> values are configurable (at compile time).
>
> The current export tags table could easily be read in at run time.
> One possibility would be to move some of the rules from osm.xml into
> osm2pgsql, e.g. the roads, leisure, water, text could become separate
> tables instead of requiring select statements in the osm.xml file. I
> guess osm2pgsql could even be taught how to interpret the osm.xml  
> file.

>> Consider this for example:
>> To render highway features in correct order I want to have z_order
>> field in planet_osm_table calculated as follow:
>
>> int  z_order ( osm_feature const& feat)
>> { 	
>> 	int layer = 0; //default
>> 	try
>>          {
>> 	     layer = boost::lexical_cast<int>(feat['layer']);
>> 	}
>> 	catch (boost::bad_lexical_cast & )
>>          {
>>                // layer tag has got lots of junk!!!
>>          }
>> 	int highway_z = 0; // 0..9
>> 	std::string highway = feat['highway']
>> 	if ( highway == 'motorway' || highway == 'motorway_link')
>>          {
>>               highway_z = 9;
>>          } 	
>> 	else if (...) {}
>> 	....
>>
>> 	bool bridge = false;
>> 	try {
>> 	    bridge = boost::lexical_cast<bool>(feat['bridge']);
>>          catch (...)  {}
>>          return   10 * ( layer + bridge?1:0)  + highway_z ;
>> }
>>
>> Also I want to have consistent numeric feature_type calculated
>> differently depending on tags/values. This will make rendering more
>> efficient and will bring some (needed) sanity to styles in osm.xml.
>>
>> 3. Also we can abstract 'output writing' to have multiple back-ends :
>> mysql, sqlite , shapefiles etc .
>>
>> What do you think?
>>
>
> Makes sense to me. I would consider having a abstraction layer for  
> both
> the data output and also the transient storage (array in RAM, database
> tables, mmaped file)

Yes, even better.

> Maybe it should adopt a plugin architecture with a config file much  
> like
> mapnik. This might allow multiple simultaneous outputs, e.g. roads to
> DB, coast to shapefile.

Exactly.

>>
>
> Are you sure you are comparing like-for-like systems? I run on
> Linux/ext3 and do not notice any significant IO issues. Mind you, my
> machine has 2GB of RAM so a lot is cached and I have 4 disks in a  
> RAID5
> setup so the IO rate is quite good.
>
You're right. Linux is 512Mb AMD x86_64 and mac is intel core 2 duo  
(running in 32-bit mode).
> Anecdotally, I have heard that OS-X tends to be slightly slower than
> Linux in most tests (I had better run for cover now...).

I've heard that DOS is the fastest filesystem for postgresql data  
dir  :)

I'm running fedora core 6 (64-bits) on the same macbook I'll try to  
do a better comparison soon.

Cheers,
Artem