[osmosis-dev] Merge huge count of files

Igor Podolskiy igor.podolskiy at vwi-stuttgart.de
Fri Sep 30 18:21:22 BST 2011


Hi Rüdiger,

On 29.09.2011 15:28, Gubler, Ruediger wrote:
> I have to merge a huge count of files. Doing this in one osmosis call
> creates thousands of threads and stops the rest of the system working well.
> Is it possible and efficient to split the giant merge into smaller pieces?
> What is the best strategy to merge a huge count (e.g. 100x100 matrix)
> together with a minimum of needed memory?
> Must the whole dataset fit in the memory?
Memory isn't the problem with merges, the only thing worth mentioning 
that merge stores in memory are the buffers. Those are either very small 
(20 entities in 0.39 release) or can be set to a more appropriate value 
on the command line (in current trunk, or HEAD now that it's in git ;)). 
Other than that, --merge just looks at the next entities on the input 
stream and chooses one of them to pass through downstream.

I think the limit you're hitting are the threads - thousands of  threads 
isn't healthy for a Java process (or any process for that matter, if 
we're talking "real", heavyweight threads).

What merge strategy you choose shouldn't matter very much - just don't 
merge too much files in the same pass. Every reader needs a thread and 
every merge needs a thread. So if you merge 8 files at a time you have 8 
readers and 3 merges with 11 threads which should be fine on a 4 core 
CPU. You would need a whole lot of passes with 10000 files, though...

Also, I would really recommend that you use the current HEAD (you can 
grab the newest build from the build server [1] if you don't want to 
compile yourself) since the default input buffer size is way to small 
for current hardware. If your buffer sizes is set to 20, you spend a 
_lot_ of time switching between the reader threads and the merge thread.

Just another thought: if your XML files are guaranteed to be fully 
disjunct (no entity ever occurs in two different files, non-overlapping 
bounding boxes is generally _not_ enough), you could more or less just 
concatenate them (modulo XML header and such) and then sort them. This 
should be equivalent to a merge. This would be very simple (and very 
fast) to do with any SAX parser/writer in whatever language, and for the 
sorting you can use Osmosis with --sort. But, again, your files 
absolutely need to be disjunct or else bad things will happen.

Hope that helps
Igor



More information about the osmosis-dev mailing list