[OSM-dev] [OSM-talk] Osmosis error with multiple bouding boxes

Fri Oct 26 10:59:33 BST 2007

Brett Henderson wrote:
> Lambertus wrote:
>> The current method of tee is indeed not truly scalable ;-) With 100 
>> pipes and -Xmx2048m the processing is still in the clear, but with 200 
>> Osmosis quits with 'out of heap memory' after processing the nodes (no 
>> ways exported yet).
> That all depends on your definition of scalable, I'm impressed that it 
> made it that far :-)  How many bounding boxes do you need to create?  If 
> it's only in the 100s then invoking osmosis several times is likely to 
> give you the best performance.  Otherwise we can look into persistence 
> alternatives.

I wrote the message with a blink smiley. I think 100 output pipes is 
already pretty impressive. Currently I need about 200, but that's only 
for a very small country (in size). It's likely that there are lots of 
situations where 1000+ are needed. But even then it's still doable with 
the current setup.
> 
> Having thought about it a bit more, my previous suggestion of persisting 
> bit sets to disk may not be a great idea, the files are going to be very 
> large and with random access patterns are going to cause your disk to 
> spend all of its time seeking because we already know they're not going 
> to be fully cached in RAM.  It could be worth a try because it should be 
> relatively simple to implement with the IndexStore class, I'm just not 
> optimistic about its performance.  A smarter index would be more 
> appropriate that caters to the fact that the selected ids within a 
> bounding box are going to be very sparse, in other words the number of 
> ids within the box are small compared to those outside it.
> 
> The store and forward approach isn't likely to help much either, the 
> java XML parser is likely to be faster than reading from temp files.  
> I'm sure there are ways of speeding up the temp files further but it's 
> not trivial.
> 
Without any knowledge about the inner workings of Osmosis, would a 
'write-through' implementation be possible (as the 
planetosm-excerpt-area.pl script does)? The data is sent to all 
'listeners' (inPipes), the inPipes check if they are interested in the 
data and write the data to the outFile (or pipe) if so. That would 
eliminate all buffering and excessive memory usage.

> If you're talking thousands of bounding boxes then a database may be the 
> way to go.  It might be quicker to load into a storage format that 
> allows you perform similar queries to the MySQL production DB.  The 
> production schema itself could be used or perhaps even something simpler 
> such as Berkeley DB.
> 
I don't see how a database could be better at storing a huge amount of 
raw data (BLOB like) than a plain file. If the BLOBs don't fit into RAM 
then any mechanism needs lots of I/O I guess.