[osmosis-dev] Performance of IndexedObjectStore

Thu Dec 8 02:07:19 GMT 2011

Hi Dennis,

On 8 December 2011 00:40, Dennis Frostlander <dennis at enkelsoft.com> wrote:

> Hi,
>
> I am using the IndexedObjectStore for storing and then accessing large
> amounts of data - around 300 millions objects.
> In the maps I am storing the Long's as the key and simple objects with few
> properties as values.
> The maps are backed up in the file system by 5 files with size ranging
> from 3 GB to 11 GB.
>
> When I start accessing the data from the collections, I am experiencing
> quite slow performance - just to enumerate all objects in the collection it
> takes around 15 hours on the 7200 rpm hard drive, with 10G of memory
> available to java vm. The java vm runs in the server mode.
>

The IndexedObjectStore uses a very simple on disk layout that tends to
result in very high levels of disk seeking.  It doesn't scale to large
datasets very effectively.

>
> I can see that the machine resources - CPU, hard drives are utilized to a
> very small amount, the respective performance counters are close to
> minimal.
>

Are you using Windows?  Which counters are you monitoring?  Disk throughput
will be minimal.  I forget the names of the counters off the top of my
head, but you need to look for counters like CPU Wait Time, and Disk Queue
Length.  The CPU Wait Time is fairly easy to understand, if you have a high
percentage then your disk IO is the bottleneck.

> I have tried to perform multi-threaded reads - in each thread I create
> separate indexed store readers. But the result is similar - the benefit is
> very small.
>

If disk seeking is the issue, more threads are unlikely to improve
performance and may in fact make it worse.

>
> Could anyone give me any suggestions how I can improve the data access and
> utilize the machine resources more efficiently?
>

There's no simple answer to this because it depends to a large extend on
your data access patterns.  About the only suggestion I have is to start
looking at using a proper database instead.  To get good performance out of
a database you need to ensure that data is organised according to your
access patterns.  One typically effective way to achieve this is to create
a data column using a PostGIS type, add a GIST index on that column, and
then cluster the table by that index.  That will organise the table
contents using the same ordering as your index which will have the effect
of grouping geographically close objects close together on disk.  Hope that
makes sense.

>
> Yours sincerely,
> Dennis Frostlander,
>
> P.S. on the related topic, I noticed that when the java process runs in
> the debug mode and the debugger is attached (either intellij idea or
> eclipse), the read operations are a magnitude slower. Not really sure why
> though...
>

Java debuggers add a lot of overhead to execution.  There's not much you
can do about it.  If you're trying to detect bottlenecks in code you need
to use a profiler and add probes targeted at specific code points, or rely
on logging.

Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/osmosis-dev/attachments/20111208/6d2366ed/attachment.html>