[OSM-dev] [OSM-talk] Osmosis error with multiple bouding boxes

Fri Oct 26 11:35:54 BST 2007

Lambertus wrote:
> Brett Henderson wrote:
>> Lambertus wrote:
>>> The current method of tee is indeed not truly scalable ;-) With 100 
>>> pipes and -Xmx2048m the processing is still in the clear, but with 
>>> 200 Osmosis quits with 'out of heap memory' after processing the 
>>> nodes (no ways exported yet).
>> That all depends on your definition of scalable, I'm impressed that 
>> it made it that far :-)  How many bounding boxes do you need to 
>> create?  If it's only in the 100s then invoking osmosis several times 
>> is likely to give you the best performance.  Otherwise we can look 
>> into persistence alternatives.
>
> I wrote the message with a blink smiley. I think 100 output pipes is 
> already pretty impressive. Currently I need about 200, but that's only 
> for a very small country (in size). It's likely that there are lots of 
> situations where 1000+ are needed. But even then it's still doable 
> with the current setup.
Okay, perhaps a better approach is warranted.
>>
>> Having thought about it a bit more, my previous suggestion of 
>> persisting bit sets to disk may not be a great idea, the files are 
>> going to be very large and with random access patterns are going to 
>> cause your disk to spend all of its time seeking because we already 
>> know they're not going to be fully cached in RAM.  It could be worth 
>> a try because it should be relatively simple to implement with the 
>> IndexStore class, I'm just not optimistic about its performance.  A 
>> smarter index would be more appropriate that caters to the fact that 
>> the selected ids within a bounding box are going to be very sparse, 
>> in other words the number of ids within the box are small compared to 
>> those outside it.
>>
>> The store and forward approach isn't likely to help much either, the 
>> java XML parser is likely to be faster than reading from temp files.  
>> I'm sure there are ways of speeding up the temp files further but 
>> it's not trivial.
>>
> Without any knowledge about the inner workings of Osmosis, would a 
> 'write-through' implementation be possible (as the 
> planetosm-excerpt-area.pl script does)? The data is sent to all 
> 'listeners' (inPipes), the inPipes check if they are interested in the 
> data and write the data to the outFile (or pipe) if so. That would 
> eliminate all buffering and excessive memory usage.
The "if they are interested in the data" bit is the bit that consumes 
memory.  By default (ie. no completeWays or completeRelations flags) the 
osmosis --bounding-x tasks work as follows:
For every node a check is performed so see if the node resides in the 
requested.
If the node is not in the area it is discarded.
If the node is in the area, it is passed to the destination and the node 
id flagged in a BitSet.
For every way, a check is performed to see if one or more of its nodes 
are selected for inclusion, this requires checks on the node BitSet.
If the way contains no nodes in the area, it is discarded.
If the way contains one or more nodes in the area it is passed to the 
destination (and mangled to remove excluded nodes) and the way id 
flagged in a BitSet.
For every relation, a check is performed to see if one or more of its 
members are selected for inclusion, this requires checks on the node, 
way and relation BitSet.
If the relation contains no members in the area, it is discarded.
If the relation contains members in the area, it is passed to the 
destination (and mangled to remove excluded members) and the relation id 
flagged in a BitSet.

Every time information must be stored in a BitSet to make a decision 
later on, memory is consumed.  Unless I'm missing something about how 
planetosm-excerpt-area works it must do something similar.  You can't 
avoid keeping track of included nodes because you need them to determine 
if a way is included.  This takes memory of some kind.

If you can create a more memory efficient method of storing included 
nodes (ways and relations as well but nodes are the biggest memory 
consumer) then the memory usage will go down but you will always need 
some memory.  If the bounding boxes are small then a sorted list of 
selected node ids may be more efficient.  Perhaps this is the way to 
solve the problem.  An in-memory list of selected nodes will be somewhat 
proportional in size to the bounding box size which should allow you to 
scale to large numbers of small boxes more effectively.
>> If you're talking thousands of bounding boxes then a database may be 
>> the way to go.  It might be quicker to load into a storage format 
>> that allows you perform similar queries to the MySQL production DB.  
>> The production schema itself could be used or perhaps even something 
>> simpler such as Berkeley DB.
>>
> I don't see how a database could be better at storing a huge amount of 
> raw data (BLOB like) than a plain file. If the BLOBs don't fit into 
> RAM then any mechanism needs lots of I/O I guess.
The main difference is that databases are good at locating and indexing 
information as opposed to a BLOB which is difficult to access randomly.  
It will always be possible to write a more efficient mechanism yourself 
(in effect building your own database) but not without a lot of work.  
Use of a db will allow you to reduce memory usage almost to zero because 
you just query for bounding box information one at a time.  In this case 
it avoids the need to read a complete planet from start to finish (other 
than the initial import), you just read the data you're interested in.  
MyISAM tables are very fast to load, an entire planet could be loaded in 
several hours on relatively modest hardware.

When I get a chance I'll look into modifying the bounding tasks to use a 
sorted list of ids and see how that compares for memory usage and 
performance.